Assess the performance of the test

pacman::p_load(outbreaker2, ape, igraph, vegan, distcrete, tidyverse, scales)
pacman::p_load_gh("CyGei/o2ools")
devtools::load_all()

Type I error (False Positive)

A Type I error, or false positive, occurs when the null hypothesis is rejected despite being true. In this context, it means concluding that two (or more) posterior chains differ when they are, in fact, identical.

We run outbreaker2 to produce a single posterior chain. We then create two (or more) bootstrapped samples by sampling with replacement from the same posterior chain and apply a statistical test to compare them. The test should yield a non-significant result (p-value > 0.05) to correctly suggest no difference.

Running outbreaker2

set.seed(111)
x <- outbreaker(
  data = outbreaker_data(
    dna = fake_outbreak$dna,
    dates = fake_outbreak$sample,
    ctd = fake_outbreak$ctd,
    w_dens = fake_outbreak$w,
    ids = as.character(1:30)
  ),
  config = create_config(
    init_pi = 1,
    move_pi = FALSE,
    find_import = FALSE
  )
)
x <- x[x$step > 1000, ] # burnin

chain <- o2ools::get_trees(out = x, ids =  as.character(1:30))


#Function to remove the introduction in each tree of the chain (e.g. NA->1)
prep_chain <- function(chain) {
  clean <- lapply(chain, function(x) {
    total_nas <- sum(is.na(x))
    from_nas <- sum(is.na(x$from))
    if (total_nas == 1 && from_nas == 1) {
      x <- x[!is.na(x$from), ]
      x[] <- lapply(x, as.character)
      return(x)
    } else {
      stop("DataFrame must have exactly one NA, and it must be in the 'from' column")
    }
  })
  return(clean)
}

chain <- prep_chain(chain)
invisible(lapply(chain, check_tree)) # checks that the trees are in the right format

Chi-square test

The chi-square test is performed 1000 times, with new samples drawn for each iteration. Under the null hypothesis, we expect around 5% of the tests to yield a p-value < 0.05 due to random chance.

For example:

set.seed(111)
lambda <- 20
t <- replicate(1000, {
  x <- rpois(n = 10, lambda)
  y <- rpois(n = 10, lambda)
  chisq.test(cbind(x, y))$p.value
})
mean(t < 0.05) # Should be close to 0.05
[1] 0.052
hist(t) # Should look roughly uniform

Note that when \(\lambda\) is small (e.g. when a cell in the contingency table has a value <5), Fisher’s exact test should be used instead. This is particularly relevant when comparing posterior chains, as it is possible for a pair, such as A->B, to appear a few times in one chain but never (0) in the other.

To assess the sensitivity of the test to sample size, we vary the number of trees sampled from the posterior distribution (chain) in each replication.

sample_sizes <- c(10, 50, 100, 500)
set.seed(111)
results <- lapply(sample_sizes, function(n) {
  test <- replicate(100, {
    get_chisq(
      sample(chain, size = n, replace = TRUE),
      sample(chain, size = n, replace = TRUE)
      #test_args = list(simulate.p.value = TRUE, B = 1000)
    )$p.value
  })
})

lapply(results, function(p_vals){
  mean(p_vals < 0.05) # Shlould be close to 0.05
})
[[1]]
[1] 0

[[2]]
[1] 0

[[3]]
[1] 0

[[4]]
[1] 0

We don’t observe the expected 5% of false positives likely because the chi-square test is used with low frequencies. Fisher’s test is computationally intensive, so we won’t try it out here but you can specify method = "fisher" in the function epitree::get_chisq().

PERMANOVA

Same methodology, using vegan::adonis2(). However, we only use 1 replicate as it takes a long time to run

sample_sizes <- c(10, 50, 100, 500)
results <- lapply(sample_sizes, function(n) {
  test <- replicate(1, {
    compare_chains(sample(chain, size = n, replace = TRUE),
                   sample(chain, size = n, replace = TRUE))$`Pr(>F)`[1]
  })
  
})
results
[[1]]
[1] 0.625

[[2]]
[1] 0.268

[[3]]
[1] 0.477

[[4]]
[1] 0.363

Type II error (False Negative)

A Type II error, or false negative, occurs when the null hypothesis is accepted despite being false. In this context, it means concluding that two (or more) posterior chains are identical when they are, in fact, different.

To assess a Type II error, we require at least two distinct outbreak scenarios. For example, one scenario involving super-spreading (heterogeneous transmission) and another with no over-dispersion (homogeneous transmission) where all individuals have the same reproduction number \(R\)).

The package simulacr will help us simulate such outbreaks.

pacman::p_load_gh("CyGei/simulacr")

Super-spreading outbreak chain_ss

True transmission tree

We specify an offspring distribution using rnbinom(n = 100, size = 0.2, mu = 3), where the small dispersion parameter (\(k\) or size), indicates high overdispersion. This results in heterogeneous transmission, with substantial variation in the number of secondary cases generated by each infector.

set.seed(111)
sim_ss <- 
  simulacr::simulate_outbreak(
  duration = 100,
  population_size = 20,
  R_values = rnbinom(n = 100, size = 0.2, mu = 3), #size is k (dispersion parameter)
  dist_incubation = fake_outbreak$w,
  dist_generation_time = fake_outbreak$w
)$data
epic_ss <- as_epicontacts(sim_ss)
plot(epic_ss)

Outbreak reconstruction

Let’s use sim_ss’s data as input for outbreaker2. The resulting output we’ll use is called chain_ss.

out_ss <- outbreaker(
  data = outbreaker_data(
    dates = sim_ss$date_onset,
    ctd = epic_ss$contacts[-1, ],
    w_dens = fake_outbreak$w,
    ids = as.character(epic_ss$linelist$id)
  ),
  config = create_config(
    init_pi = 1,
    move_pi = FALSE,
    find_import = FALSE
  )
)
plot(out_ss)

out_ss <- out_ss[out_ss$step > 1000, ] # burnin
plot(out_ss, type = "network")
chain <- get_trees(out = out_ss, ids = as.character(epic_ss$linelist$id))
chain_ss <- prep_chain(chain)
invisible(lapply(chain_ss, check_tree))

Check that outbreaker2 did a good job by computing the proportion of correctly assigned ancestries.

o2ools::get_accuracy(chain_ss, epic_ss$contacts[-1, ]) %>% 
  hist(
    main = "Accuracy of outbreak reconstruction",
    xlab = "% of correctly inferred ancestries",
    ylab = "Count"
  )

“Uniform” outbreak

True transmission tree

Here we provide the same reproduction number \(R\) for everyone. For fair comparison, I ran a separate script to find a random seed that yields the same number of cases as out_ss and we’ll assign randomly the case IDs from out_ss to the below. The resulting output we’ll use is called chain_uni.

set.seed(3961)
sim_uni <- 
  simulacr::simulate_outbreak(
  duration = 100,
  population_size = 20,
  R_values = 1L,
  dist_incubation = fake_outbreak$w,
  dist_generation_time = fake_outbreak$w
)$data
epic_uni <- as_epicontacts(sim_uni)
plot(epic_uni)

@Thibaut R_values is not respected (some cases have R=2). Is this normal?

Outbreak reconstruction

Let’s use sim_uni’s data as input for outbreaker2.

out_uni <- outbreaker(
  data = outbreaker_data(
    dates = sim_uni$date_onset,
    ctd = epic_uni$contacts[-1, ],
    w_dens = fake_outbreak$w,
    ids = as.character(epic_uni$linelist$id)
  ),
  config = create_config(
    init_pi = 1,
    move_pi = FALSE,
    find_import = FALSE
  )
)
plot(out_uni)

out_uni <- out_uni[out_uni$step > 1000, ] # burnin
plot(out_uni, type = "network")
# assign the IDs from sim_ss
common_ids <- sample(as.character(epic_ss$linelist$id))
chain <- get_trees(out = out_uni, ids = common_ids)
chain_uni <- prep_chain(chain)
invisible(lapply(chain_uni, check_tree))

Comparing two distinct posterior chains

chain_ss refer to the superspreading outbreak and chain_uni to the “uniform” outbreak.

Chi-square test

sample_sizes <- c(10, 50, 100, 500)
set.seed(111)
results <- lapply(sample_sizes, function(n) {
  test <- replicate(100, {
    get_chisq(
      sample(chain_ss, size = n, replace = TRUE),
      sample(chain_uni, size = n, replace = TRUE)
      #test_args = list(simulate.p.value = TRUE, B = 1000)
    )$p.value
  })
  
  return(mean(test < 0.05))
})

names(results) <- paste("Sample Size:", sample_sizes)

results
$`Sample Size: 10`
[1] 1

$`Sample Size: 50`
[1] 1

$`Sample Size: 100`
[1] 1

$`Sample Size: 500`
[1] 1

PERMANOVA

We only replicate the test once due to running time.

sample_sizes <- c(10, 50, 100, 500)
system.time({
  results <- lapply(sample_sizes, function(n) {
    test <- replicate(1, {
      compare_chains(
        sample(chain_ss, size = n, replace = TRUE),
        sample(chain_uni, size = n, replace = TRUE)
      )$`Pr(>F)`[1]
    })
    
  })
  names(results) <- paste("Sample Size:", sample_sizes)
})
   user  system elapsed 
  22.86    1.92   24.79 
results
$`Sample Size: 10`
[1] 0.001

$`Sample Size: 50`
[1] 0.001

$`Sample Size: 100`
[1] 0.001

$`Sample Size: 500`
[1] 0.001

Sensitivity Analysis

We now want to perform a sensitivity analysis on two posterior chains of transmission trees (chain_A, chain_B) to investigate how the sample size (sample_sizes) and the proportion of overlap between the two chains (overlap_freqs) affect the performance of the test (compare_chains). See run_analysis.R

source(here::here("analysis/performance/R", "run_analysis.R"))
print(run_analysis)
  1. It first generates a grid of all possible combinations of sample_sizes and overlap_freqs.

  2. For each combination of parameters:

    • Two samples are created:

      • reference: A sample of size n from chainA only.

      • mixed: A sample of size n combining chainA and chainB. It draws from chainA with probability overlap_freq and draws from chainB with probability 1-overlap_freq .

    • The test compares reference with mixed and returns a p-value.

    • This step is repeated n_repeats times.

  3. The function returns a 3D array:

    • The rows represent the iteration id (n_repeat).
    • The columns represent the sample_size.
    • The 3rd dimension represent the overlapping frequency between the two chains (overlap_freq).
    • The entries refer to the corresponding p-values.
result <- run_analysis(
    chainA = chain_uni,
    chainB = chain_ss,
    sample_sizes = c(50, 200, 1000),
    overlap_freqs = seq(0, 1, 0.1), #generate_sequence(0, 1, 0.1)
    n_repeats = 1000
  )
# Took 51 hours to run!!!!! @Thibaut

Let’s read the results from the simulation.

run_date <- "2025-01-08"
result <- readRDS(here::here("analysis/performance/data", run_date, "/result.rds")) %>% reshape2::melt()
str(result)
'data.frame':   33000 obs. of  4 variables:
 $ n_repeat    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ sample_size : int  50 50 50 50 50 50 50 50 50 50 ...
 $ overlap_freq: num  0 0 0 0 0 0 0 0 0 0 ...
 $ value       : num  0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 ...

The plot shows the distribution of p-values comparing two chains of posterior transmission trees over 1000 iterations.

  • With 50 trees per chain, the test detects a significant difference unless over 70% of the samples are from a shared distribution.

  • For chains with 200 trees, this threshold rises to 90%.

  • For 1000 trees, the test accepts the null hypothesis of no difference only when the chains are entirely drawn from the same posterior distribution.

Overall, as sample size increases, the test becomes more sensitive to differences between the chains, requiring a higher proportion of shared samples to accept the null hypothesis of no difference.

Alternatively, we can look at the frequency of p-values that are < and \(\ge\) to 0.05. Again, we expect on average 5% of false positives for two identical chains.

TO DISCUSS

We will be computing the following (note FPR = 1-sensitivity and FNR = 1-TPR) :

Metric Definition

overlap_freq = 1

(chains are identical)

overlap_freq < 1

(partial overlap)

overlap_freq = 0

(chains are distinct)

False Positive Rate (FPR)

(type I error)

Proportion of tests incorrectly rejecting the null hypothesis (detecting a difference when there is none). Should be close to 5% (at α = 0.05) Unclear NA

True Positive Rate (TPR)

(sensitivity)

Proportion of tests correctly rejecting the null hypothesis (detecting a difference when there is one). NA Unclear Should be close to 100%

True Negative Rate (TNR)

(specificity)

Proportion of tests correctly accepting the null hypothesis (detecting no difference when there is none). Should be 95% Unclear NA

False Negative Rate (FNR)

(type II error)

Proportion of tests incorrectly accepting the null hypothesis (detecting no difference when there is one). NA Unclear Should be close to 0%
Accuracy Proportion of correct results (true positives + true negatives). Close to 95% Unclear Should be close to 100%
Optimal Sample Size / Power Minimum n needed to achieve desired TPR

The goal is to:

  • Compare test performance across different sample sizes

  • Evaluate how the “discrimination” ability changes with overlap_freq

  • Identify optimal significance level for balancing sensitivity and specificity

ROC

  • Consider overlap_freq = 0 as our “positive” case (chains are truly different)
  • Consider overlap_freq = 1 as our “negative” case (chains are identical)